INN Hotels Project¶

Context¶

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  • Loss of resources (revenue) when the hotel cannot resell the room.
  • Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  • Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  • Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Problem Description¶

In today's hospitality industry, the prevalence of booking cancellations poses significant challenges for hotels, impacting revenue, operational efficiency, and customer satisfaction. INN Hotels Group, a prominent chain of hotels in Portugal, is grappling with the detrimental effects of high cancellation rates.

The primary objective is to develop a Machine Learning (ML) solution capable of accurately predicting booking cancellations in advance. This predictive model will empower INN Hotels Group to anticipate and proactively address potential cancellations, thereby minimizing revenue loss, optimizing resource allocation, and enhancing overall operational efficiency. This will also allow INN Hotel to institute new profitably policies on cancellations and refunds.

Importing necessary libraries and data¶

In [450]:
# Installing the libraries with the specified version.
#!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

Load libraries and packages¶

In [604]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

import warnings
warnings.filterwarnings("ignore")

from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)

Initialize some basic Panda configurations¶

In [605]:
# removing the limit for the number of displayed columns
pd.set_option("display.max_columns", None) # To set column limits replace None with a number
# setting the limit for the number of displayed rows
pd.set_option("display.max_rows", None) # To set row limits replace None with a number
# setting the precision of floating numbers to 2 decimal points
pd.set_option("display.float_format", lambda x: "%.6f" % x)

Load useful visualization functions¶

In [606]:
# Provided by GreatLearning
# function to create histogram and boxplot; both are aligned by mean
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [607]:
# Provided by GreatLearning
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [608]:
# Provided by GreatLearning
# function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
    )

    plt.tight_layout()
    plt.show()
In [609]:
# Provided by GreatLearning
# Display a stacked barplot
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [610]:
# Purpose: Create Boxplot for multiple variables (x being a categorical value)
#
# Inputs:
#
#     in_data: DataFrame object containing rows and columns of data
#     x_feature: str representing the column name for the x-axis (categorical data)
#     y_feature: str representing the column name for the y-axis
#
def multi_boxplot (in_data, x_feature, y_feature):

    # Only proceed if the features is a single column string name and data is a DataFrame
    if isinstance(in_data,pd.DataFrame) and type(x_feature) == str and type(y_feature) == str:

        # visualizing the relationship between two featgures
        plt.figure(figsize=(12, 5))
        sns.boxplot(data=in_data, x=x_feature, y=y_feature, showmeans=True)
        plt.xticks(fontsize=15)
        plt.yticks(fontsize=15)
        plt.xticks(rotation='vertical')
        plt.xlabel(x_feature, fontsize=15)
        plt.ylabel(y_feature, fontsize=15);
        
        plt.show() 

Load model related functions¶

In [611]:
#Outlier detection
def outlier_detection(data):
    """
    Display a grid of box plots for each numeric feature; while showing the outlier data

    data: dataframe
    
    """
    
    # outlier detection using boxplot
    numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
    # dropping booking_status
    numeric_columns.remove("booking_status")

    plt.figure(figsize=(15, 12))

    for i, variable in enumerate(numeric_columns):
        plt.subplot(4, 4, i + 1)
        plt.boxplot(data[variable], whis=1.5)
        plt.tight_layout()
        plt.title(variable)

    plt.show()
In [612]:
# Purpose: To treat outliers by clipping them to the lower and upper whisker
#
# Inputs:
#     df: Dataframe
#     col: Feature that has outliers to treat
#
# Note: This procedure is being utilized from GreatLearning; Week 4 (Hands_on_Notebook_ExploratoryDataAnalysis)
def treat_outliers(df, col):
    """
    treats outliers in a variable
    col: str, name of the numerical variable
    df: dataframe
    col: name of the column
    """
    Q1 = df[col].quantile(0.25)  # 25th quantile
    Q3 = df[col].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR

    # all the values smaller than lower_whisker will be assigned the value of lower_whisker
    # all the values greater than upper_whisker will be assigned the value of upper_whisker
    # the assignment will be done by using the clip function of NumPy
    df[col] = np.clip(df[col], lower_whisker, upper_whisker)

    return df
In [613]:
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [614]:
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [615]:
# Provided by GreatLearning
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [616]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [617]:
# Provided by GreatLearning
# we will define a function to check VIF
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif
In [618]:
# Provided by GreatLearning
# defining a function to plot the precision vs recall vs threshold
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])

Read in the hotel data and make a copy¶

In [619]:
#Import the data set
original_data = pd.read_csv ("./INNHotelsGroup.csv")

#Make a copy of the data
data = original_data.copy()

Quick check to ensure data is read in properly¶

In [620]:
# Verify the data file was read correctly by displaying the first five rows.
data.head(5)
Out[620]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.000000 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.680000 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.000000 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.000000 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.500000 0 Canceled
In [621]:
# Verify the entire data file was read correctly by displaying the last five rows.
data.tail(5)
Out[621]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
36270 INN36271 3 0 2 6 Meal Plan 1 0 Room_Type 4 85 2018 8 3 Online 0 0 0 167.800000 1 Not_Canceled
36271 INN36272 2 0 1 3 Meal Plan 1 0 Room_Type 1 228 2018 10 17 Online 0 0 0 90.950000 2 Canceled
36272 INN36273 2 0 2 6 Meal Plan 1 0 Room_Type 1 148 2018 7 1 Online 0 0 0 98.390000 2 Not_Canceled
36273 INN36274 2 0 0 3 Not Selected 0 Room_Type 1 63 2018 4 21 Online 0 0 0 94.500000 0 Canceled
36274 INN36275 2 0 1 2 Meal Plan 1 0 Room_Type 1 207 2018 12 30 Offline 0 0 0 161.670000 0 Not_Canceled

Data Overview¶

  • Observations
  • Sanity checks
In [622]:
#Check the size of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} features.")
There are 36275 rows and 19 features.
In [623]:
#Check overall information on the features
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

Observations¶

  • Booking_ID, type_of_meal_plan, market_segment_tyhpe, and booking_status are object types.
  • avg_price_per_room is of type float64.
  • The remainder of the fields are of type int64.
  • There are no features missing data.
In [624]:
#Show the statistical summary of the data
data.describe(include='all').T
Out[624]:
count unique top freq mean std min 25% 50% 75% max
Booking_ID 36275 36275 INN00001 1 NaN NaN NaN NaN NaN NaN NaN
no_of_adults 36275.000000 NaN NaN NaN 1.844962 0.518715 0.000000 2.000000 2.000000 2.000000 4.000000
no_of_children 36275.000000 NaN NaN NaN 0.105279 0.402648 0.000000 0.000000 0.000000 0.000000 10.000000
no_of_weekend_nights 36275.000000 NaN NaN NaN 0.810724 0.870644 0.000000 0.000000 1.000000 2.000000 7.000000
no_of_week_nights 36275.000000 NaN NaN NaN 2.204300 1.410905 0.000000 1.000000 2.000000 3.000000 17.000000
type_of_meal_plan 36275 4 Meal Plan 1 27835 NaN NaN NaN NaN NaN NaN NaN
required_car_parking_space 36275.000000 NaN NaN NaN 0.030986 0.173281 0.000000 0.000000 0.000000 0.000000 1.000000
room_type_reserved 36275 7 Room_Type 1 28130 NaN NaN NaN NaN NaN NaN NaN
lead_time 36275.000000 NaN NaN NaN 85.232557 85.930817 0.000000 17.000000 57.000000 126.000000 443.000000
arrival_year 36275.000000 NaN NaN NaN 2017.820427 0.383836 2017.000000 2018.000000 2018.000000 2018.000000 2018.000000
arrival_month 36275.000000 NaN NaN NaN 7.423653 3.069894 1.000000 5.000000 8.000000 10.000000 12.000000
arrival_date 36275.000000 NaN NaN NaN 15.596995 8.740447 1.000000 8.000000 16.000000 23.000000 31.000000
market_segment_type 36275 5 Online 23214 NaN NaN NaN NaN NaN NaN NaN
repeated_guest 36275.000000 NaN NaN NaN 0.025637 0.158053 0.000000 0.000000 0.000000 0.000000 1.000000
no_of_previous_cancellations 36275.000000 NaN NaN NaN 0.023349 0.368331 0.000000 0.000000 0.000000 0.000000 13.000000
no_of_previous_bookings_not_canceled 36275.000000 NaN NaN NaN 0.153411 1.754171 0.000000 0.000000 0.000000 0.000000 58.000000
avg_price_per_room 36275.000000 NaN NaN NaN 103.423539 35.089424 0.000000 80.300000 99.450000 120.000000 540.000000
no_of_special_requests 36275.000000 NaN NaN NaN 0.619655 0.786236 0.000000 0.000000 0.000000 1.000000 5.000000
booking_status 36275 2 Not_Canceled 24390 NaN NaN NaN NaN NaN NaN NaN

Observations:¶

  • There are four unique values for type_of_meal_plan; canidiate for dummy category
  • There are seven unique values for room_type_reserved; candidate for dummy category
  • There are five unique values for market_segment_type; candidate for dummy category
  • There are two unique values for booking_status; this is the dependent variable (Y value).
  • The average no_of_adults is 1.844; while the average no_of_children is only 0.105.
    • This is an indication that this hotel may cater to mostly adults
  • The average lead_time is 85.23 days.
  • The average of the avg_price_per_room is 103.42 euros; while the max average price is 540 euros.
  • The most common room type selected is: Room_Type 1.

Check for Missing Values¶

In [625]:
# Check for missing values.
data.isnull().sum()
Out[625]:
Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

Observations¶

  • No missing values.

Check for the number of unique values¶

In [626]:
data.nunique()
Out[626]:
Booking_ID                              36275
no_of_adults                                5
no_of_children                              6
no_of_weekend_nights                        8
no_of_week_nights                          18
type_of_meal_plan                           4
required_car_parking_space                  2
room_type_reserved                          7
lead_time                                 352
arrival_year                                2
arrival_month                              12
arrival_date                               31
market_segment_type                         5
repeated_guest                              2
no_of_previous_cancellations                9
no_of_previous_bookings_not_canceled       59
avg_price_per_room                       3930
no_of_special_requests                      6
booking_status                              2
dtype: int64

Observations¶

  • The Booking_ID has 36275 unique values and therefore will need to be dropped prior to the classification reqression modeling and the decision tree modeling.
  • Based on arrival_year only have 2 unique values, this data set only has two (2) years worth of data.
  • Based on no_of_previous_bookings_not_canceled having a value of 59 indicates a potential for repeat customers and a future loyalty program (if not already established).
  • The booking_status has two unique values; which indicates either booked or cancelled.
    • It is not likely this column has bad data; which is good since it is the dependent variable (Y data).

Check for duplicate values¶

In [627]:
# Check for duplicate values in the "Booking_ID" column
duplicate_booking_ids = data[data.duplicated(subset=['Booking_ID'], keep=False)]

# If there are duplicate booking IDs, they will need to be removed.
if duplicate_booking_ids.empty:
    print("No duplicate Booking_IDs found.")
else:
    print("Duplicate Booking_IDs found:")
    print(duplicate_booking_ids)
No duplicate Booking_IDs found.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Drop the Booking_ID since it's a unique Identifier and will not prove useful during the modeling efforts¶

In [628]:
data.drop(columns=["Booking_ID"],axis=1,inplace=True)

Convert the Booking_status values of Canceled and Not_Canceled to 1s and 0s.¶

  • Since the model is to predict whether a customer will cancel, we will convert values of Canceled to 1.
In [629]:
data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)

Verify the object data has clean and unique values and there are no misspellings, etc.¶

In [630]:
# Ensure consistent values for object features
# Loop through each column
for column in data.columns:
    if data[column].dtype == 'object':  # Check if column dtype is object (categorical)
        unique_values = data[column].unique()
        print(f"Unique values for column '{column}':")
        for value in unique_values:
            print("\t * ",value)
Unique values for column 'type_of_meal_plan':
	 *  Meal Plan 1
	 *  Not Selected
	 *  Meal Plan 2
	 *  Meal Plan 3
Unique values for column 'room_type_reserved':
	 *  Room_Type 1
	 *  Room_Type 4
	 *  Room_Type 2
	 *  Room_Type 6
	 *  Room_Type 5
	 *  Room_Type 7
	 *  Room_Type 3
Unique values for column 'market_segment_type':
	 *  Offline
	 *  Online
	 *  Corporate
	 *  Aviation
	 *  Complementary

EDA¶

  • It is a good idea to explore the data once again after manipulating it.
In [631]:
#Verify the column was dropped successfuly
print(f"There are {data.shape[0]} rows and {data.shape[1]} features.")
data.head(5)
There are 36275 rows and 18 features.
Out[631]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.000000 0 0
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.680000 1 0
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.000000 0 1
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.000000 0 1
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.500000 0 1
In [632]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36275 non-null  int64  
 1   no_of_children                        36275 non-null  int64  
 2   no_of_weekend_nights                  36275 non-null  int64  
 3   no_of_week_nights                     36275 non-null  int64  
 4   type_of_meal_plan                     36275 non-null  object 
 5   required_car_parking_space            36275 non-null  int64  
 6   room_type_reserved                    36275 non-null  object 
 7   lead_time                             36275 non-null  int64  
 8   arrival_year                          36275 non-null  int64  
 9   arrival_month                         36275 non-null  int64  
 10  arrival_date                          36275 non-null  int64  
 11  market_segment_type                   36275 non-null  object 
 12  repeated_guest                        36275 non-null  int64  
 13  no_of_previous_cancellations          36275 non-null  int64  
 14  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 15  avg_price_per_room                    36275 non-null  float64
 16  no_of_special_requests                36275 non-null  int64  
 17  booking_status                        36275 non-null  int64  
dtypes: float64(1), int64(14), object(3)
memory usage: 5.0+ MB
In [633]:
# Show the statistical summary
data.describe(include='all').T
Out[633]:
count unique top freq mean std min 25% 50% 75% max
no_of_adults 36275.000000 NaN NaN NaN 1.844962 0.518715 0.000000 2.000000 2.000000 2.000000 4.000000
no_of_children 36275.000000 NaN NaN NaN 0.105279 0.402648 0.000000 0.000000 0.000000 0.000000 10.000000
no_of_weekend_nights 36275.000000 NaN NaN NaN 0.810724 0.870644 0.000000 0.000000 1.000000 2.000000 7.000000
no_of_week_nights 36275.000000 NaN NaN NaN 2.204300 1.410905 0.000000 1.000000 2.000000 3.000000 17.000000
type_of_meal_plan 36275 4 Meal Plan 1 27835 NaN NaN NaN NaN NaN NaN NaN
required_car_parking_space 36275.000000 NaN NaN NaN 0.030986 0.173281 0.000000 0.000000 0.000000 0.000000 1.000000
room_type_reserved 36275 7 Room_Type 1 28130 NaN NaN NaN NaN NaN NaN NaN
lead_time 36275.000000 NaN NaN NaN 85.232557 85.930817 0.000000 17.000000 57.000000 126.000000 443.000000
arrival_year 36275.000000 NaN NaN NaN 2017.820427 0.383836 2017.000000 2018.000000 2018.000000 2018.000000 2018.000000
arrival_month 36275.000000 NaN NaN NaN 7.423653 3.069894 1.000000 5.000000 8.000000 10.000000 12.000000
arrival_date 36275.000000 NaN NaN NaN 15.596995 8.740447 1.000000 8.000000 16.000000 23.000000 31.000000
market_segment_type 36275 5 Online 23214 NaN NaN NaN NaN NaN NaN NaN
repeated_guest 36275.000000 NaN NaN NaN 0.025637 0.158053 0.000000 0.000000 0.000000 0.000000 1.000000
no_of_previous_cancellations 36275.000000 NaN NaN NaN 0.023349 0.368331 0.000000 0.000000 0.000000 0.000000 13.000000
no_of_previous_bookings_not_canceled 36275.000000 NaN NaN NaN 0.153411 1.754171 0.000000 0.000000 0.000000 0.000000 58.000000
avg_price_per_room 36275.000000 NaN NaN NaN 103.423539 35.089424 0.000000 80.300000 99.450000 120.000000 540.000000
no_of_special_requests 36275.000000 NaN NaN NaN 0.619655 0.786236 0.000000 0.000000 0.000000 1.000000 5.000000
booking_status 36275.000000 NaN NaN NaN 0.327636 0.469358 0.000000 0.000000 0.000000 1.000000 1.000000

Univariate Analysis¶

In [634]:
#Show listing of all the columns
data.columns
Out[634]:
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',
       'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month',
       'arrival_date', 'market_segment_type', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'avg_price_per_room', 'no_of_special_requests', 'booking_status'],
      dtype='object')

Obvservations on no_of_adults¶

In [635]:
labeled_barplot(data, feature="no_of_adults", perc=True)

Observation¶

  • 72% of bookings are made or kept by two adults; while single adult bookings are made 21.2% of the time

Observations on no_of_children¶

In [636]:
labeled_barplot(data, feature="no_of_children", perc=True)

Observation¶

  • 92.6% of the bookings are with adults that do not have children.

Observations on no_of_weekend_nights¶

In [637]:
labeled_barplot(data, feature="no_of_weekend_nights", perc=True)

Observations¶

  • 46.5% of bookings are for only one (1) night.
  • 52.6% of bookings are for one or two nights.

Observations for no_of_week_nights¶

In [638]:
labeled_barplot(data, feature="no_of_week_nights", perc=True)

Observations¶

  • 31.5% of the bookings were for two week nights
  • 26.5% of the bookings were for one week night
  • 21.6% of the bookings were for three week nights

Observations on type_of_meal_plan¶

In [639]:
labeled_barplot(data, feature="type_of_meal_plan", perc=True)

Observation¶

  • 76.7% of the meal plans selectged are Meal Plan 1
  • 14.1% of the time a meal plan is not selected.

Observation on required_car_parking_space¶

In [640]:
labeled_barplot(data, feature="required_car_parking_space", perc=True)

Observation¶

  • Most bookings (96.9%) do not require a parking space.

Observations on room_type_reserved¶

In [641]:
labeled_barplot(data, feature="room_type_reserved", perc=True)

Observation¶

  • 77.5% of the time Room_Type 1 is selected.
  • 16.7% of the time Room_type 4 is selected.

Observation on arrival_year¶

In [642]:
labeled_barplot(data, feature="arrival_year", perc=True)
In [643]:
#Let's investigate 2017 a bit further
data[data['arrival_year']==2017]["arrival_month"].value_counts()
Out[643]:
10    1913
9     1649
8     1014
12     928
11     647
7      363
Name: arrival_month, dtype: int64
In [644]:
#Let's investigate 2018 a bit further
data[data['arrival_year']==2018]["arrival_month"].value_counts()
Out[644]:
10    3404
6     3203
9     2962
8     2799
4     2736
5     2598
7     2557
3     2358
11    2333
12    2093
2     1704
1     1014
Name: arrival_month, dtype: int64

Observations¶

  • There are only two years worth of data (2017-2018)
  • 82% of the booking data was for 2018.
  • 2017 data is only for the 2nd half of the year.
    • This may be an indication that the hotel opened up in the 2nd half of 2017 and it took several months to ramp up.
  • 2018 data shows a steady increase in monthly bookings

Observations on arrival_date¶

In [645]:
labeled_barplot(data, feature="arrival_date", perc=True)

Observations¶

  • No significant observations

Observations on market_segment_type¶

In [646]:
labeled_barplot(data, feature="market_segment_type", perc=True)

Observations¶

  • Approximately, two thirds of the bookings are done online.
  • 1.1% of the bookings are complementary and thus have a price of 0.
  • A very small percentage (.3%) are airline flight related.
  • Only 5.6% are from business bookings.

Observations on repeated_guest¶

In [647]:
labeled_barplot(data, feature="repeated_guest", perc=True)

Observations¶

  • Most customers, 97.4%, are not repeat customers.

Observations on no_of_special_requests¶

In [648]:
labeled_barplot(data, feature="no_of_special_requests", perc=True)

Observation¶

  • 54.5% of bookings had no special requests.
  • 31.4% of bookings had one (1) special request.

Observations on booking_status¶

In [649]:
labeled_barplot(data, feature="booking_status", perc=True)

Observations¶

  • The hotel is experiencing approximately a one third cancellation rate.

Observations on lead_time¶

In [650]:
histogram_boxplot(data, feature="lead_time")

Observations¶

  • The data is right-skewed.
  • lead_time has quite a few outliers beyond the max whisker

Observations on no_of_previous_cancellations¶

In [651]:
histogram_boxplot(data, feature="no_of_previous_cancellations")

Observations¶

  • No significant observations from this plot.

Observations on no_of_previous_bookings_not_canceled¶

In [652]:
histogram_boxplot(data, feature="no_of_previous_bookings_not_canceled")

Observations¶

  • No significant observations

Observations on avg_price_per_room¶

In [653]:
histogram_boxplot(data, feature="avg_price_per_room",kde=True)

Observations¶

  • The avg_price_per_room is approximately 100 euros.
  • The distribution is very close to a normal distribution, but it is slightly right skewed.

Bivariate Analysis¶

In [654]:
# Display the numeric fields in a heatmap to determine if there are any correlations between features
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

Observations¶

  • There are no significant correlations between the numerical values.
  • There's a slight correlation between no_of_previous_bookings_not_canceled and repeated_guest
  • This is also a good indication there may be minimal adjustments needed to remove multi-collinearity between features.

Determine how avg_price_per_room may impact booking_status.¶

In [655]:
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")

Observations¶

  • The median avg_price_per_room is slightly higher for those than have cancelled vs those that have not.

Let's determine if lead_time may impact booking_status¶

In [656]:
distribution_plot_wrt_target(data, "lead_time", "booking_status")

Observations¶

  • Larger lead times may influence cancellations.
    • Cancellations have a median lead_time of approximately 125 days.
    • Non-cancellations have a median lead_time of less than 50 days.

Let's determine if the type_of_meal_plan impacts booking_status¶

In [657]:
stacked_barplot(data, "type_of_meal_plan", "booking_status")
booking_status         0      1    All
type_of_meal_plan                     
All                24390  11885  36275
Meal Plan 1        19156   8679  27835
Not Selected        3431   1699   5130
Meal Plan 2         1799   1506   3305
Meal Plan 3            4      1      5
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • Bookings with Meal Plan 2 have the highest percentage of cancellations.
  • Meal Plan 3 has the lowest percentage of cancellations. However, the number of bookings for this meal plan is very low and insignificant.
  • Recommendation: Hotel may want to consider promoting Meal Plan 1 to help decrease cancellations or make improvements to Meal Plan 2 to closer mimic the successes of Meal Plan 1.

Let's determine if the number of traveling people impact booking_status.¶

In [658]:
# Create a new field for total guests and add the number of adults with the number of children traveling.
total_guests_data = data.copy()

# Add up the total number of guests traveling for each booking
total_guests_data["no_of_guests"] = (
    total_guests_data["no_of_adults"] + total_guests_data["no_of_children"]
)

# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "no_of_guests", "booking_status")
booking_status      0      1    All
no_of_guests                       
All             24390  11885  36275
2               15662   8280  23942
1                5743   1809   7552
3                2459   1392   3851
4                 514    398    912
5                  10      5     15
11                  0      1      1
10                  1      0      1
12                  1      0      1
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • There is insignificant data for number of guests greater than 5 (these bar plots will be disregarded)
  • Booking with four (4) or less guests tend to cancel as the number of guests increases.

Let's determine if the number of total days impacts the booking status¶

In [659]:
# Create a new field for total number of days and add the number of week nights with the number of weekends nights.

#Make a temporary copy of the data
total_nights_data = data.copy()


# Add up the total number of week night for each booking
total_nights_data["total_nights"] = (
    total_nights_data["no_of_week_nights"] + total_nights_data["no_of_weekend_nights"]
)

# View the total counts for each total night value to help reduce the insignificant data values.
total_nights_data["total_nights"].value_counts()
Out[659]:
3     10052
2      8472
1      6604
4      5893
5      2589
6      1031
7       973
8       179
9       111
10      109
0        78
11       39
14       32
15       31
12       24
13       18
20       11
19        6
16        6
17        5
21        4
18        3
23        2
22        2
24        1
Name: total_nights, dtype: int64
In [660]:
# Let's remove some of the data containing insignificant counts for easier analysis on the stacked barplot.
total_nights_data = total_nights_data[total_nights_data["total_nights"] <= 15]

# Review the info again
total_nights_data.info()                                       
<class 'pandas.core.frame.DataFrame'>
Int64Index: 36235 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36235 non-null  int64  
 1   no_of_children                        36235 non-null  int64  
 2   no_of_weekend_nights                  36235 non-null  int64  
 3   no_of_week_nights                     36235 non-null  int64  
 4   type_of_meal_plan                     36235 non-null  object 
 5   required_car_parking_space            36235 non-null  int64  
 6   room_type_reserved                    36235 non-null  object 
 7   lead_time                             36235 non-null  int64  
 8   arrival_year                          36235 non-null  int64  
 9   arrival_month                         36235 non-null  int64  
 10  arrival_date                          36235 non-null  int64  
 11  market_segment_type                   36235 non-null  object 
 12  repeated_guest                        36235 non-null  int64  
 13  no_of_previous_cancellations          36235 non-null  int64  
 14  no_of_previous_bookings_not_canceled  36235 non-null  int64  
 15  avg_price_per_room                    36235 non-null  float64
 16  no_of_special_requests                36235 non-null  int64  
 17  booking_status                        36235 non-null  int64  
 18  total_nights                          36235 non-null  int64  
dtypes: float64(1), int64(15), object(3)
memory usage: 5.5+ MB
In [661]:
# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_nights_data, "total_nights", "booking_status")
booking_status      0      1    All
total_nights                       
All             24382  11853  36235
3                6466   3586  10052
2                5573   2899   8472
4                3952   1941   5893
1                5138   1466   6604
5                1766    823   2589
6                 566    465   1031
7                 590    383    973
8                 100     79    179
10                 51     58    109
9                  58     53    111
14                  5     27     32
15                  5     26     31
11                 24     15     39
12                  9     15     24
13                  3     15     18
0                  76      2     78
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • As the total number of days traveling increases so does the chances of cancelling.
  • Traveling five (5) or less days has the bettering non-cancellation rates compared to those bookings with more days.

Let's investigate further whether the room_type_reserved has an impact on booking_status¶

In [662]:
# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "room_type_reserved", "booking_status")
booking_status          0      1    All
room_type_reserved                     
All                 24390  11885  36275
Room_Type 1         19058   9072  28130
Room_Type 4          3988   2069   6057
Room_Type 6           560    406    966
Room_Type 2           464    228    692
Room_Type 5           193     72    265
Room_Type 7           122     36    158
Room_Type 3             5      2      7
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • It appears that room type booked may have an impact on booking status
  • Recommendation: Gather additional data on the various room types (price, smoking preference, handicapped etc.).
    • Additional data, may give the hotel insights on how to improve and reduce cancellations.

Determine if required_car_parking_space impacts booking_status¶

In [663]:
# Display the stacked barplot for no_of_guests vs booking_status
stacked_barplot(total_guests_data, "required_car_parking_space", "booking_status")
booking_status                  0      1    All
required_car_parking_space                     
All                         24390  11885  36275
0                           23380  11771  35151
1                            1010    114   1124
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • Guests who required a car parking space are less likely to cancel.
  • Unfortunately, only a small percentage of guests require a parking space.

Observations¶

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

1. What are the busiest months in the hotel?¶

In [664]:
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
print(monthly_data)
print(monthly_data.values)

# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
    {"Month": list(monthly_data.index), "Number of Bookings": list(monthly_data.values)}
)

# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Number of Bookings")
plt.show()
arrival_month
1     1014
2     1704
3     2358
4     2736
5     2598
6     3203
7     2920
8     3813
9     4611
10    5317
11    2980
12    3021
Name: booking_status, dtype: int64
[1014 1704 2358 2736 2598 3203 2920 3813 4611 5317 2980 3021]

Observations¶

  • The busiest months of the year is June, August, September, and October.

2. Which market segment do most of the guests come from?¶

In [665]:
# Let's determine the marget segement that most of the guests come from
labeled_barplot(data, feature="market_segment_type", perc=True)

# Let's also plot the market_segement_type vs booking_status
stacked_barplot(data, "market_segment_type", "booking_status")
booking_status           0      1    All
market_segment_type                     
All                  24390  11885  36275
Online               14739   8475  23214
Offline               7375   3153  10528
Corporate             1797    220   2017
Aviation                88     37    125
Complementary          391      0    391
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • Most of the guest bookings come from the online marketing segment.
  • The online marketing segement also contains most of the cancellations
    • This is likely due to the ease of cancelling via online
  • The Offline and Aviation marketing segments have approximately the same ratio of cancellations.

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?¶

In [666]:
# Display multi-boxplots by marget_segement_type
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="market_segment_type", y="avg_price_per_room")
plt.show()
In [667]:
# Grouping the data on market_segment_type and then take the median of avg_price_per_room
market_segment_data = data.groupby(["market_segment_type"])["avg_price_per_room"].median()
market_segment_data
Out[667]:
market_segment_type
Aviation         95.000000
Complementary     0.000000
Corporate        79.000000
Offline          90.000000
Online          107.100000
Name: avg_price_per_room, dtype: float64

Observations¶

  • The median avg_price_per_room value of 107 euros is the largest for the Online market segment.
  • The lowest median avg_price_per_room value is for the Complementary markget segment; which makes sense.

4. What percentage of bookings are canceled?¶

In [668]:
# Determine the percentage of bookings that are cancelled
labeled_barplot(data, feature="booking_status", perc=True)

# Create a stacked barplot of arrival_months vs booking_status
stacked_barplot(data,"arrival_month","booking_status")
booking_status      0      1    All
arrival_month                      
All             24390  11885  36275
10               3437   1880   5317
9                3073   1538   4611
8                2325   1488   3813
7                1606   1314   2920
6                1912   1291   3203
4                1741    995   2736
5                1650    948   2598
11               2105    875   2980
3                1658    700   2358
2                1274    430   1704
12               2619    402   3021
1                 990     24   1014
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • 32.8% of the bookings are cancelled.
  • The summer months of May-August has the highest % of booking cancellations.

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?¶

In [669]:
# Create a stacked barplot
stacked_barplot(data,"repeated_guest","booking_status") 

# Calculate percentage of cancellations for each repeated_guest value
cancellation_percentage = data.groupby(['repeated_guest', 'booking_status']).size().unstack(fill_value=0)
cancellation_percentage = cancellation_percentage.apply(lambda x: x / x.sum(), axis=1) * 100
print (cancellation_percentage)
booking_status      0      1    All
repeated_guest                     
All             24390  11885  36275
0               23476  11869  35345
1                 914     16    930
------------------------------------------------------------------------------------------------------------------------
booking_status         0         1
repeated_guest                    
0              66.419578 33.580422
1              98.279570  1.720430

Observations¶

  • Only 1.72% of the repeating guests cancelled.

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?¶

In [670]:
stacked_barplot(data, "no_of_special_requests", "booking_status") 
booking_status              0      1    All
no_of_special_requests                     
All                     24390  11885  36275
0                       11232   8545  19777
1                        8670   2703  11373
2                        3727    637   4364
3                         675      0    675
4                          78      0     78
5                           8      0      8
------------------------------------------------------------------------------------------------------------------------

Observations¶

  • Yes, it appears that if the number of special requests is 2 or less it does impact cancellations.

Data Preprocessing¶

Outlier Detection¶

In [671]:
outlier_detection(data)

Observations¶

  • There are quite a few outliers in the data.
  • However, we will not treat them as they are proper values

Data Preparations¶

  • Encoding of booking_status has already been completed.
Creating training and test sets.¶
In [672]:
# specifying the independent and dependent variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

# adding a constant to the independent variables
X = sm.add_constant(X)

# creating dummy variables
X = pd.get_dummies(X, drop_first=True)

# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [673]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 28)
Shape of test set :  (10883, 28)
Percentage of classes in training set:
0   0.670644
1   0.329356
Name: booking_status, dtype: float64
Percentage of classes in test set:
0   0.676376
1   0.323624
Name: booking_status, dtype: float64

Model Building - Logistic Regression¶

  • We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.

  • Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.

In [674]:
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)

print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25364
Method:                           MLE   Df Model:                           27
Date:                Fri, 19 Apr 2024   Pseudo R-squ.:                  0.3293
Time:                        17:09:55   Log-Likelihood:                -10793.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -924.5923    120.817     -7.653      0.000   -1161.390    -687.795
no_of_adults                             0.1135      0.038      3.017      0.003       0.040       0.187
no_of_children                           0.1563      0.057      2.732      0.006       0.044       0.268
no_of_weekend_nights                     0.1068      0.020      5.398      0.000       0.068       0.146
no_of_week_nights                        0.0398      0.012      3.239      0.001       0.016       0.064
required_car_parking_space              -1.5939      0.138    -11.561      0.000      -1.864      -1.324
lead_time                                0.0157      0.000     58.868      0.000       0.015       0.016
arrival_year                             0.4570      0.060      7.633      0.000       0.340       0.574
arrival_month                           -0.0415      0.006     -6.418      0.000      -0.054      -0.029
arrival_date                             0.0005      0.002      0.252      0.801      -0.003       0.004
repeated_guest                          -2.3469      0.617     -3.805      0.000      -3.556      -1.138
no_of_previous_cancellations             0.2664      0.086      3.108      0.002       0.098       0.434
no_of_previous_bookings_not_canceled    -0.1727      0.153     -1.131      0.258      -0.472       0.127
avg_price_per_room                       0.0188      0.001     25.404      0.000       0.017       0.020
no_of_special_requests                  -1.4690      0.030    -48.790      0.000      -1.528      -1.410
type_of_meal_plan_Meal Plan 2            0.1768      0.067      2.654      0.008       0.046       0.307
type_of_meal_plan_Meal Plan 3           17.8379   5057.771      0.004      0.997   -9895.211    9930.887
type_of_meal_plan_Not Selected           0.2782      0.053      5.245      0.000       0.174       0.382
room_type_reserved_Room_Type 2          -0.3610      0.131     -2.761      0.006      -0.617      -0.105
room_type_reserved_Room_Type 3          -0.0009      1.310     -0.001      0.999      -2.569       2.567
room_type_reserved_Room_Type 4          -0.2821      0.053     -5.305      0.000      -0.386      -0.178
room_type_reserved_Room_Type 5          -0.7176      0.209     -3.432      0.001      -1.127      -0.308
room_type_reserved_Room_Type 6          -0.9456      0.147     -6.434      0.000      -1.234      -0.658
room_type_reserved_Room_Type 7          -1.3964      0.293     -4.767      0.000      -1.971      -0.822
market_segment_type_Complementary      -41.8798   8.42e+05  -4.98e-05      1.000   -1.65e+06    1.65e+06
market_segment_type_Corporate           -1.1935      0.266     -4.487      0.000      -1.715      -0.672
market_segment_type_Offline             -2.1955      0.255     -8.625      0.000      -2.694      -1.697
market_segment_type_Online              -0.3990      0.251     -1.588      0.112      -0.891       0.093
========================================================================================================

Observations¶

  • Negative values of the coefficient show that the probability of a guest cancelling a booking decreases with the increase of the corresponding attribute value.

  • Positive values of the coefficient show that the probability of a guest cancelling a booking increases with the increase of the corresponding attribute value.

  • p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

Model Performance Evaluation¶

  • Model can make wrong predictions as:

    1. Predicting a guest will cancel but in reality they do not cancel.

    2. Predicting a guest will not cancel but in reality they do cancel.

  • Which case is more important?**

    • Both the cases are important as:

      • (False Positive): If we predict a guest will cancel, but actually they do not will result in the hotel overbooking their rooms resulting in unsatisfied customers.

      • (False Negative): If we predict a person will not cancel, but actually they do will result in the hotel losing revenues by not fully booking their rooms to capacity.

      • Therefore, both of these scenarios (Type I and Type II errors) are important and we therefore, want to minimize both.

  • How to reduce this loss?**

    • We need to reduce both False Negatives (Recall) and False Positives (Precision)

    • f1_score should be maximized as the greater the f1_score, the higher the chances of reducing both False Negatives and False Positives and identifying both the classes correctly

    • fi_score is computed as $$f1\_score = \frac{2 * Precision * Recall}{Precision + Recall}$$

Model performance evaluation¶

  • The model_performance_classification_statsmodels function will be used to check the model performance of models.
  • The confusion_matrix_statsmodels function will be used to plot confusion matrix.
In [675]:
# Convert to float because if you do not it causes problems when creating the confusion matrix
X_train = X_train.astype(float)

# Display the confusion matrix
confusion_matrix_statsmodels(lg, X_train, y_train)
In [676]:
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
Out[676]:
Accuracy Recall Precision F1
0 0.806041 0.634222 0.739749 0.682933

Observations¶

  • The f1_score of the model is ~0.68 and we will try to maximize it further

  • The variables used to build the model might contain multicollinearity, which will affect the p-values

    • We will have to remove multicollinearity from the data to get reliable coefficients and p-values

Checking Multicollinearity¶

  • In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.

There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).

  • Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.

  • General Rule of thumb:

    • If VIF is 1 then there is no correlation among the $k$th predictor and the remaining predictor variables, and hence the variance of $\beta_k$ is not inflated at all
    • If VIF exceeds 5, we say there is moderate multicollinearity
    • If VIF is equal or exceeding 10, it shows signs of high multi-collinearity
  • The purpose of the analysis should dictate which threshold to use

In [677]:
vif_series = pd.Series(
    [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
    index=X_train.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

const                                  39468156.706004
no_of_adults                                  1.348154
no_of_children                                1.978229
no_of_weekend_nights                          1.069475
no_of_week_nights                             1.095667
required_car_parking_space                    1.039928
lead_time                                     1.394914
arrival_year                                  1.430830
arrival_month                                 1.275673
arrival_date                                  1.006738
repeated_guest                                1.783516
no_of_previous_cancellations                  1.395689
no_of_previous_bookings_not_canceled          1.651986
avg_price_per_room                            2.050421
no_of_special_requests                        1.247278
type_of_meal_plan_Meal Plan 2                 1.271851
type_of_meal_plan_Meal Plan 3                 1.025216
type_of_meal_plan_Not Selected                1.272183
room_type_reserved_Room_Type 2                1.101438
room_type_reserved_Room_Type 3                1.003302
room_type_reserved_Room_Type 4                1.361515
room_type_reserved_Room_Type 5                1.027810
room_type_reserved_Room_Type 6                1.973072
room_type_reserved_Room_Type 7                1.115123
market_segment_type_Complementary             4.500109
market_segment_type_Corporate                16.928435
market_segment_type_Offline                  64.113924
market_segment_type_Online                   71.176430
dtype: float64

Observations¶

  • 75% of the market_segement_type dummy categories have very high VIF values.
  • Let's drop market_segment_type_Online dummy category field and re-assess the vif values.
In [678]:
# Let's drop the market_segment_type_Online column from both the x_train and x_test data frames
col_to_drop = "market_segment_type_Online"
X_train1 = X_train.loc[:, ~X_train.columns.str.startswith(col_to_drop)]
X_test1 = X_test.loc[:, ~X_test.columns.str.startswith(col_to_drop)]
In [679]:
# Reassess the VIF values
vif_series = pd.Series(
    [variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
    index=X_train1.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

const                                  39391371.314593
no_of_adults                                  1.331784
no_of_children                                1.977350
no_of_weekend_nights                          1.069039
no_of_week_nights                             1.095118
required_car_parking_space                    1.039795
lead_time                                     1.390637
arrival_year                                  1.428376
arrival_month                                 1.274625
arrival_date                                  1.006721
repeated_guest                                1.780188
no_of_previous_cancellations                  1.395447
no_of_previous_bookings_not_canceled          1.651745
avg_price_per_room                            2.049595
no_of_special_requests                        1.242418
type_of_meal_plan_Meal Plan 2                 1.271497
type_of_meal_plan_Meal Plan 3                 1.025216
type_of_meal_plan_Not Selected                1.270387
room_type_reserved_Room_Type 2                1.101271
room_type_reserved_Room_Type 3                1.003301
room_type_reserved_Room_Type 4                1.356004
room_type_reserved_Room_Type 5                1.027810
room_type_reserved_Room_Type 6                1.972732
room_type_reserved_Room_Type 7                1.115003
market_segment_type_Complementary             1.338253
market_segment_type_Corporate                 1.527769
market_segment_type_Offline                   1.597418
dtype: float64

Observations¶

  • All features now have relatively low VIF values indicating that multi-collinearity has been resolved.
  • Let's rebuild the model and reassess the model's performance metrics.
In [680]:
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)

print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
Training performance:
Out[680]:
Accuracy Recall Precision F1
0 0.805766 0.633744 0.739294 0.682462

Observations¶

  • There was no significant change of the performance metrics after the field was dropped.
  • Proceed with new model as it's a little less complex after dropping the one field.
  • Let's now address any features that have p-values > .05 (indicating no significance)
In [681]:
# Let's review the summary for additional p-value analysis
print(lg1.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25365
Method:                           MLE   Df Model:                           26
Date:                Fri, 19 Apr 2024   Pseudo R-squ.:                  0.3292
Time:                        17:10:06   Log-Likelihood:                -10794.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -933.3324    120.655     -7.736      0.000   -1169.813    -696.852
no_of_adults                             0.1060      0.037      2.841      0.004       0.033       0.179
no_of_children                           0.1542      0.057      2.694      0.007       0.042       0.266
no_of_weekend_nights                     0.1075      0.020      5.439      0.000       0.069       0.146
no_of_week_nights                        0.0405      0.012      3.295      0.001       0.016       0.065
required_car_parking_space              -1.5907      0.138    -11.538      0.000      -1.861      -1.320
lead_time                                0.0157      0.000     58.933      0.000       0.015       0.016
arrival_year                             0.4611      0.060      7.711      0.000       0.344       0.578
arrival_month                           -0.0411      0.006     -6.358      0.000      -0.054      -0.028
arrival_date                             0.0005      0.002      0.257      0.797      -0.003       0.004
repeated_guest                          -2.3140      0.618     -3.743      0.000      -3.526      -1.102
no_of_previous_cancellations             0.2633      0.086      3.074      0.002       0.095       0.431
no_of_previous_bookings_not_canceled    -0.1728      0.152     -1.136      0.256      -0.471       0.125
avg_price_per_room                       0.0187      0.001     25.374      0.000       0.017       0.020
no_of_special_requests                  -1.4709      0.030    -48.891      0.000      -1.530      -1.412
type_of_meal_plan_Meal Plan 2            0.1794      0.067      2.694      0.007       0.049       0.310
type_of_meal_plan_Meal Plan 3           19.8256   1.36e+04      0.001      0.999   -2.67e+04    2.67e+04
type_of_meal_plan_Not Selected           0.2745      0.053      5.181      0.000       0.171       0.378
room_type_reserved_Room_Type 2          -0.3640      0.131     -2.784      0.005      -0.620      -0.108
room_type_reserved_Room_Type 3          -0.0018      1.310     -0.001      0.999      -2.569       2.566
room_type_reserved_Room_Type 4          -0.2763      0.053     -5.207      0.000      -0.380      -0.172
room_type_reserved_Room_Type 5          -0.7182      0.209     -3.436      0.001      -1.128      -0.308
room_type_reserved_Room_Type 6          -0.9408      0.147     -6.402      0.000      -1.229      -0.653
room_type_reserved_Room_Type 7          -1.3891      0.293     -4.743      0.000      -1.963      -0.815
market_segment_type_Complementary      -47.7454   7.09e+06  -6.74e-06      1.000   -1.39e+07    1.39e+07
market_segment_type_Corporate           -0.8033      0.103     -7.807      0.000      -1.005      -0.602
market_segment_type_Offline             -1.7995      0.052    -34.577      0.000      -1.902      -1.698
========================================================================================================

Observations¶

  • The following four features have p-values > .05 and thus can be removed:
    • arrival_date
    • no_of_previous_bookings_not_canceled
    • type_of_meal_plan_Meal Plan 3
    • room_type_reserved_Room_Type 3
    • market_segment_type_Complementary

Removing high p-value variables¶

  • We will do the following repeatedly using a loop:
    • Build a model, check the p-values of the variables, and drop the column with the highest p-value.
    • Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
    • Repeat the above two steps till there are no columns with p-value > 0.05.

Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

In [682]:
# initial list of columns
cols = X_train1.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    X_train_aux = X_train1[cols]

    # fitting the model
    model = sm.Logit(y_train, X_train_aux).fit(disp=False)

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
In [683]:
# Let's create new X_train and X_test sets using only the selected features (they should all have p-values < .05)
X_train2 = X_train1[selected_features]
X_test2 = X_test1[selected_features]
In [684]:
# Review the training set feature set
X_train2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25392 entries, 13662 to 33003
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   const                           25392 non-null  float64
 1   no_of_adults                    25392 non-null  float64
 2   no_of_children                  25392 non-null  float64
 3   no_of_weekend_nights            25392 non-null  float64
 4   no_of_week_nights               25392 non-null  float64
 5   required_car_parking_space      25392 non-null  float64
 6   lead_time                       25392 non-null  float64
 7   arrival_year                    25392 non-null  float64
 8   arrival_month                   25392 non-null  float64
 9   repeated_guest                  25392 non-null  float64
 10  no_of_previous_cancellations    25392 non-null  float64
 11  avg_price_per_room              25392 non-null  float64
 12  no_of_special_requests          25392 non-null  float64
 13  type_of_meal_plan_Meal Plan 2   25392 non-null  float64
 14  type_of_meal_plan_Not Selected  25392 non-null  float64
 15  room_type_reserved_Room_Type 2  25392 non-null  float64
 16  room_type_reserved_Room_Type 4  25392 non-null  float64
 17  room_type_reserved_Room_Type 5  25392 non-null  float64
 18  room_type_reserved_Room_Type 6  25392 non-null  float64
 19  room_type_reserved_Room_Type 7  25392 non-null  float64
 20  market_segment_type_Corporate   25392 non-null  float64
 21  market_segment_type_Offline     25392 non-null  float64
dtypes: float64(22)
memory usage: 4.5 MB
In [685]:
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp=False)

print(lg2.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25370
Method:                           MLE   Df Model:                           21
Date:                Fri, 19 Apr 2024   Pseudo R-squ.:                  0.3283
Time:                        17:10:12   Log-Likelihood:                -10809.
converged:                       True   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                           -917.2860    120.456     -7.615      0.000   -1153.376    -681.196
no_of_adults                       0.1086      0.037      2.914      0.004       0.036       0.182
no_of_children                     0.1522      0.057      2.660      0.008       0.040       0.264
no_of_weekend_nights               0.1086      0.020      5.501      0.000       0.070       0.147
no_of_week_nights                  0.0418      0.012      3.403      0.001       0.018       0.066
required_car_parking_space        -1.5943      0.138    -11.561      0.000      -1.865      -1.324
lead_time                          0.0157      0.000     59.218      0.000       0.015       0.016
arrival_year                       0.4531      0.060      7.591      0.000       0.336       0.570
arrival_month                     -0.0424      0.006     -6.568      0.000      -0.055      -0.030
repeated_guest                    -2.7365      0.557     -4.915      0.000      -3.828      -1.645
no_of_previous_cancellations       0.2289      0.077      2.983      0.003       0.078       0.379
avg_price_per_room                 0.0192      0.001     26.343      0.000       0.018       0.021
no_of_special_requests            -1.4699      0.030    -48.892      0.000      -1.529      -1.411
type_of_meal_plan_Meal Plan 2      0.1654      0.067      2.487      0.013       0.035       0.296
type_of_meal_plan_Not Selected     0.2858      0.053      5.405      0.000       0.182       0.389
room_type_reserved_Room_Type 2    -0.3560      0.131     -2.725      0.006      -0.612      -0.100
room_type_reserved_Room_Type 4    -0.2826      0.053     -5.330      0.000      -0.387      -0.179
room_type_reserved_Room_Type 5    -0.7352      0.208     -3.529      0.000      -1.143      -0.327
room_type_reserved_Room_Type 6    -0.9650      0.147     -6.572      0.000      -1.253      -0.677
room_type_reserved_Room_Type 7    -1.4312      0.293     -4.892      0.000      -2.005      -0.858
market_segment_type_Corporate     -0.7928      0.103     -7.711      0.000      -0.994      -0.591
market_segment_type_Offline       -1.7867      0.052    -34.391      0.000      -1.889      -1.685
==================================================================================================
In [686]:
print("Training performance:")
model_performance_classification_statsmodels(lg2, X_train2, y_train)
Training performance:
Out[686]:
Accuracy Recall Precision F1
0 0.805411 0.632548 0.739033 0.681657

Observations¶

  • The new model performance metrics are still in-line with the previous model metrics.
  • Therefore, we will continue to use this newest model as the metrics are comparable and it's less complex due to the additional features dropped after the p-value analysis.

Converting coefficients to odds¶

  • The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  • Therefore, odds = exp(b)
  • The percentage change in odds is given as odds = (exp(b) - 1) * 100
In [687]:
# converting coefficients to odds
odds = np.exp(lg2.params)

# finding the percentage change
perc_change_odds = (np.exp(lg2.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train2.columns).T
Out[687]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month repeated_guest no_of_previous_cancellations avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Corporate market_segment_type_Offline
Odds 0.000000 1.114754 1.164360 1.114753 1.042636 0.203048 1.015835 1.573235 0.958528 0.064797 1.257157 1.019348 0.229941 1.179916 1.330892 0.700461 0.753830 0.479403 0.380991 0.239033 0.452584 0.167504
Change_odd% -100.000000 11.475363 16.436009 11.475256 4.263629 -79.695231 1.583521 57.323511 -4.147245 -93.520258 25.715665 1.934790 -77.005947 17.991562 33.089244 -29.953888 -24.617006 -52.059666 -61.900934 -76.096691 -54.741616 -83.249628

Coefficient interpretations

  • no_of_adults: Holding all other features constant a 1 unit change in no_of_adults will increase the odds of the guest cancelling by ~1.11 times or a ~11.48% increase in odds of cancelling.
  • no_of_previous_cancellations: Holding all other features constant a 1 unit change in no_of_previous_cancellations will increase the odds of the guest cancelling by ~1.25 times or a ~25.72% increase in odds of cancelling.
  • no_of_special_requests: Holding all other features constant a 1 unit change in no_of_special_requests will decrease the odds of the guest cancelling by ~.22 times or a 77.01% decrease in odds of cancelling.
  • required_car_parking_space: Holding all other features constant a 1 unit change in required_car_parking_space will decrease the odds of the guest cancelling by ~0.203 times or a 79.7% decrease in odds of cancelling.
  • repeat_guest: Holding all other features constant a 1 unit change in repeat_guest will decrease the odds of cancelling by ~0.06 times or a 93.52% decrease in odds of cancelling.

Interpretation for other attributes can be done similarly.

Training Data Performance¶

In [688]:
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train)
In [689]:
log_reg_model_train_perf = model_performance_classification_statsmodels(
    lg2, X_train2, y_train
)

print("Training performance:")
log_reg_model_train_perf
Training performance:
Out[689]:
Accuracy Recall Precision F1
0 0.805411 0.632548 0.739033 0.681657

Test Data Performance¶

In [690]:
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test)
In [691]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg2, X_test2, y_test
)

print("Training performance:")
log_reg_model_test_perf
Training performance:
Out[691]:
Accuracy Recall Precision F1
0 0.804649 0.630892 0.729003 0.676408

Observations¶

  • The model is giving a decent f1_score of ~0.682 and ~0.676 on the train and test sets respectively
  • The raw data had 67.2% of the bookings not cancelling. Therefore, the current model improves (accuracy is ~80.5%) over just stating all guests will not cancel.
  • As the train and test performances are comparable, the model is not overfitting
  • Moving forward we will try to improve the performance of the model

Model Improvements¶

  • Let's see if the f1_score can be improved further by changing the model threshold
  • First, we will check the ROC curve, compute the area under the ROC curve (ROC-AUC), and then use it to find the optimal threshold
  • Next, we will check the Precision-Recall curve to find the right balance between precision and recall as our metric of choice is f1_score

ROC Curve and ROC-AUC¶

  • ROC-AUC on training set
In [692]:
# Plot the False Positive Rate (FPR) vs True Positive Rate (TPR)
logit_roc_auc_train = roc_auc_score(y_train, lg2.predict(X_train2))
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
  • Logistic Regression model is giving a good performance on training set.

Optimal threshold using AUC-ROC curve¶

In [693]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))

# Find the optimal threshold by finding the maximum value between TRP and FPR.
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(f"The AUC-ROC optimal threshold is {optimal_threshold_auc_roc}")
The AUC-ROC optimal threshold is 0.3710466623490246

Checking model performance on training set using AUC-ROC optimal threshold¶

In [694]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
In [695]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[695]:
Accuracy Recall Precision F1
0 0.792888 0.735621 0.668696 0.700564
  • The Recall and F1 of model have both increased but the other two metrics have reduced.
  • The model is still giving a good performance.

Checking model performance on test set¶

In [696]:
# Plot the False Positive Rate (FPR) vs True Positive Rate (TPR)
logit_roc_auc_test = roc_auc_score(y_test, lg2.predict(X_test2))
fpr, tpr, thresholds = roc_curve(y_test, lg2.predict(X_test2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
In [697]:
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc)
In [698]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
Out[698]:
Accuracy Recall Precision F1
0 0.796012 0.739353 0.666667 0.701131

The model performs similar with both the Test and Training data sets.¶

Precision-Recall Curve¶

In [699]:
# Plot the Precision vs Recall intersecting line graph
y_scores = lg2.predict(X_train2)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)

plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

# Find intersection points where precision equals recall
# Once the intersection is found, set the optimal_threshold_curve value
intersection_points = []
for i in range(len(prec)):
    if prec[i] == rec[i]:
        optimal_threshold_curve = tre[i]

print(f"optimal_threshold_curve is {optimal_threshold_curve} ")
optimal_threshold_curve is 0.4209574614254219 

Checking model performance on training set¶

In [700]:
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train, threshold=optimal_threshold_curve)
In [701]:
# Calculate the model's performance metrics against the training data set
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg2, X_train2, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[701]:
Accuracy Recall Precision F1
0 0.801749 0.698912 0.699079 0.698995
  • Model is performing well on training set.
  • There's not much improvement in the model performance as the default threshold is 0.50 and here we get ~0.421 as the optimal threshold.

Checking model performance on test set¶

In [702]:
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_curve)
In [703]:
# Calculate the model's performance metrics against the test data set
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg2, X_test2, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
Out[703]:
Accuracy Recall Precision F1
0 0.804098 0.703010 0.695115 0.699040

The current model with the optimal threshold is performing similar for both the training and test data sets.¶

Model Performance Comparison and Final Model Selection¶

In [704]:
# training performance comparison
models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold (0.5)",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[704]:
Logistic Regression-default Threshold (0.5) Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.805411 0.792888 0.801749
Recall 0.632548 0.735621 0.698912
Precision 0.739033 0.668696 0.699079
F1 0.681657 0.700564 0.698995
In [705]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold (0.5)",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[705]:
Logistic Regression-default Threshold (0.5) Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.804649 0.796012 0.804098
Recall 0.630892 0.739353 0.703010
Precision 0.729003 0.666667 0.695115
F1 0.676408 0.701131 0.699040
  • Almost all the three models are performing well on both training and test data without the problem of overfitting
  • The model with a default threshold (0.37) is giving the best F1 score. Therefore it can be selected as the best logistics regression model

Final Logistics Regression Model Summary¶

  • We have been able to build a predictive model that can be used by the INN Hotel to better predict when guests may cancel their hotel room booking with an f1_score of ~.701 on the training set and formulate new cancellation and refund policies.

  • All the logistic regression models have given a generalized performance on the training and test set.

  • Coefficient of no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, lead_time, arrival_year, no_of_previous_cancellations, avg_price_per_room, type_of_meal_plan_Meal Plan 2, and type_of_meal_plan_NotSelected are positive and therefore an increase in these will lead to increase in the chances of a guest cancelling their booking.

  • Coefficient of required_car_parking_space, arrival_month, repeated_guest, no_of_special_requests, room_type_reserved_Room_type 2, room_type_reserved_Room_Type 4, room_type_reserfved_room_type 5, room_type_reserved_Room_type 6, room_type_reserved_Room_type 7, market_segment_type_Corporate, and market_segment_type_Offline are negative and therefore decreases the chances of a person cancelling a booking.

Building a Decision Tree model¶

In [706]:
#Since we went through detailed EDA on the original dataset we will not go through it again.
#Decision Trees models are not impacted by multi-collinearity and therefore we should start with the original data sets.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36275 non-null  int64  
 1   no_of_children                        36275 non-null  int64  
 2   no_of_weekend_nights                  36275 non-null  int64  
 3   no_of_week_nights                     36275 non-null  int64  
 4   type_of_meal_plan                     36275 non-null  object 
 5   required_car_parking_space            36275 non-null  int64  
 6   room_type_reserved                    36275 non-null  object 
 7   lead_time                             36275 non-null  int64  
 8   arrival_year                          36275 non-null  int64  
 9   arrival_month                         36275 non-null  int64  
 10  arrival_date                          36275 non-null  int64  
 11  market_segment_type                   36275 non-null  object 
 12  repeated_guest                        36275 non-null  int64  
 13  no_of_previous_cancellations          36275 non-null  int64  
 14  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 15  avg_price_per_room                    36275 non-null  float64
 16  no_of_special_requests                36275 non-null  int64  
 17  booking_status                        36275 non-null  int64  
dtypes: float64(1), int64(14), object(3)
memory usage: 5.0+ MB

Data Preparation for Modeling¶

In [707]:
# specifying the independent  and dependent variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

# adding a constant to the independent variables
X = sm.add_constant(X)

# creating dummy variables
X = pd.get_dummies(X, drop_first=True)

# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [708]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 28)
Shape of test set :  (10883, 28)
Percentage of classes in training set:
0   0.670644
1   0.329356
Name: booking_status, dtype: float64
Percentage of classes in test set:
0   0.676376
1   0.323624
Name: booking_status, dtype: float64

Decision Tree (default)¶

In [709]:
# Create a Decision Tree Classifier using a random_State=1 for repeatability of data sets
model0 = DecisionTreeClassifier(random_state=1)

# Firt the model using training data
model0.fit(X_train, y_train)
Out[709]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [710]:
# Let's display the confusion matrix for the default decision tree model using training data.
confusion_matrix_sklearn(model0, X_train, y_train)
In [711]:
# Display the Decision Tree default model performance metrics for the Training Data
decision_tree_perf_train_without = model_performance_classification_sklearn(
    model0, X_train, y_train
)
decision_tree_perf_train_without
Out[711]:
Accuracy Recall Precision F1
0 0.994211 0.986608 0.995776 0.991171
In [712]:
# Display the Decision Tree default model performance metrics for the Test Data
decision_tree_perf_test_without = model_performance_classification_sklearn(
    model0, X_test, y_test
)
decision_tree_perf_test_without
Out[712]:
Accuracy Recall Precision F1
0 0.874299 0.814026 0.800838 0.807378

Observations¶

  • Since all four metrics are very close to 1.0. (max value), this is an indication that overfitting is occurring.
  • For good measure, we ran the default model against the test data and the performance metrics are also significantly lower. This solidfies we should prune the Decision Tree.

Do we need to prune the tree?¶

Yes, based on the performance metrics for both the Training and Test data sets (see above), the Decision Tree should be pruned.

Before pruning the tree let's check the important features.¶

In [713]:
# Plot the important features using a bar chart
feature_names = list(X_train.columns)
importances = model0.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Pruning the tree¶

Pre-Pruning¶

In [714]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
# Use f1_score since it's important to equally try and reduce FP and FN.
acc_scorer = make_scorer(f1_score)

# Run the grid search
# The GridSearchCV runs through all combinations of the parameters that can then be used
# to select the best estimator.
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[714]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)

Checking performance on training set¶

In [715]:
# Create the confusion matrix using Training data.
confusion_matrix_sklearn(estimator, X_train, y_train)
In [716]:
# Create the new model's performance metrics against the training data.
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[716]:
Accuracy Recall Precision F1
0 0.831010 0.786201 0.724278 0.753971

Checking the performance metrics for the Test Data¶

In [717]:
# Create the confusion matrix against the test data.
confusion_matrix_sklearn(estimator, X_test, y_test)
In [718]:
# create the new model's performance metrics against the test data.
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[718]:
Accuracy Recall Precision F1
0 0.834972 0.783362 0.727584 0.754444
  • The model is giving a generalized result now since the f1 scores on both the train and test data are coming to be around ~0.75 which shows that the model is able to generalize well on unseen data.
In [719]:
# Plot the important features in a bar plot.
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In [720]:
# Create a tree visualation graph of the Decision Tree Model
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [721]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 132.08] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

Observations from the pre-pruned tree¶

Using the above extracted decision rules we can make interpretations from the decision tree model like:

  • If the lead_time <= 151.50, the no_of_special_requests <= .5, the market_segment_type_Online <= .50, the lead_time <= 90.50, the no_of_weekend_nights <= .50, the average_price_per_room > 196.50, then the guest is likely to cancel the booking.

Interpretations from other decision rules can be made similarly

In [722]:
# Importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)

# Plot the most important features
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

  • In the pre tuned decision tree, the common important features are: lead_time, market_segement_type_Online, no_of_special_requests, and avg_price_per_room.

Decision Tree (Post pruning)¶

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree¶

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [723]:
# Initialize the Decision Tree Classifier with a random_state=1 for reproducibility
# and a class_weight=balanced to balance the influece of different classes during training
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Calculate the pruning path for the decision tree classifier using cost complexity pruning. 
path = clf.cost_complexity_pruning_path(X_train, y_train)

#Extract the alpha values along with their associated impurities
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [724]:
# Convert to a panda dataframe to show the first 10 rows to verify.
# Path contains the values of the alpha (the complexity parameter) and the corresponding impurities 
# for different pruning levels.
pd.DataFrame(path).head(10)
Out[724]:
ccp_alphas impurities
0 0.000000 0.008376
1 0.000000 0.008376
2 0.000000 0.008376
3 0.000000 0.008376
4 0.000000 0.008376
5 0.000000 0.008376
6 0.000000 0.008376
7 0.000000 0.008376
8 0.000000 0.008376
9 0.000000 0.008376
In [725]:
# Convert to a panda dataframe to show the last 10 rows to verify.
# Path contains the values of the alpha (the complexity parameter) and the corresponding impurities 
# for different pruning levels.
pd.DataFrame(path).tail(10)
Out[725]:
ccp_alphas impurities
1834 0.002967 0.296306
1835 0.003095 0.299401
1836 0.003936 0.303338
1837 0.004547 0.307885
1838 0.005636 0.319156
1839 0.008902 0.328058
1840 0.009802 0.337860
1841 0.012719 0.350579
1842 0.034121 0.418821
1843 0.081179 0.500000
In [726]:
# Plot effective alphas vs total impurity of leaves
# Remove the last alpha/impurities node which corresponds to a fully pruned tree.
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
  • As the alpha increases the leaf impurity increases. At the largest alpha value, the tree node is the most impure.
  • Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
In [727]:
# clfs will be used to store the decision tree classifiers trained for different alpha values.
# Initialize to an empty list before we start loading
clfs = []

# Loop through each alpha value and create a Decision Tree Classifier for that particular alpha value
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    
    #Use the decision tree classifier to train the model
    clf.fit(X_train, y_train)
    
    #Store the trained decision tree classifier
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [728]:
# Remove the last element that represents the fully pruned tree (since it doesn't add any value)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Create a node_count list that contains the number of nodes for each classifier
node_counts = [clf.tree_.node_count for clf in clfs]

# Get the max depth value for each classifier and store in the depth list.
depth = [clf.tree_.max_depth for clf in clfs]

# Plot the number of nodes vs alphas
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")

# Plot the maximum depth of tree vs alphas
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
  • As alpha increases the number of nodes drops dramatically.
  • As alpha increases the maximum depth of the tree also drops significantly.
In [729]:
# For each decision tree classifier predict the training values and then calculate and store the F1 score.
# f1_train will contain the list of f1_scores for each decision tree classifer trained.
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)
In [730]:
# For each decision tree classifier predict the testing values and then calculate and store the F1 score.
# f1_test will contain the list of f1_scores for each decision tree classifer on test data.
f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [731]:
# Plot the alpha vs F1 Scores for both Training and Test data sets
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
  • For very small alpha values the blue plot (training data set) has a very high F1 Score (starts at 1.0) indicating overfitting.
  • The alpha for the maximum F1 Score for the test data is likely what is needed for best performance.
    • Let's next calculate this value programmatically since the plot is just a high-level visual.
In [732]:
# Creating the model where we get highest test f1_score
index_best_model = np.argmax(f1_test)

# Get the best_model from the highest test f1_score.
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167002,
                       class_weight='balanced', random_state=1)

Using the best_model, check the performance on training set¶

In [733]:
# Create a confusion matrix of the best_model against the training data.
confusion_matrix_sklearn(best_model, X_train, y_train)
In [734]:
# Calculate the best_model's performance metrics against the training data.
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[734]:
Accuracy Recall Precision F1
0 0.899575 0.903145 0.812762 0.855573

Using the best_model, check the performance on test set¶

In [735]:
# Create the confusion matrix for the best_model against the test data.
confusion_matrix_sklearn(best_model, X_test, y_test)
In [736]:
# Calculate the best_model's performance metrics against the test data.
decision_tree_post_perf_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_post_perf_test
Out[736]:
Accuracy Recall Precision F1
0 0.868419 0.855764 0.765363 0.808043
  • In the post-pruned tree also, the model is giving a generalized result since the f1 scores on both the train and test data are coming to be around 0.856 and .808 respectively; which shows that the model is able to generalize well on unseen data.
In [737]:
# Plot the tree structure of the best_model
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# Draw the arrows
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [738]:
# Text report showing the rules of a decision tree of the best_model
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |--- lead_time <= 16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 68.50
|   |   |   |   |   |   |   |   |   |--- weights: [207.26, 10.63] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  68.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |--- arrival_date >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 7.59] class: 1
|   |   |   |   |   |   |   |--- lead_time >  16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 135.00
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  135.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |--- weights: [1199.59, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 63.29
|   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [41.75, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 59.75
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [14.91, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  59.75
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 59.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  63.29
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 3.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [20.13, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |--- weights: [413.04, 27.33] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- avg_price_per_room <= 99.98
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 62.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  62.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [8.20, 25.81] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [55.17, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- lead_time >  73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  99.98
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 132.43
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 122.97] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  132.43
|   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- avg_price_per_room <= 75.07
|   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 58.75
|   |   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  58.75
|   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.96, 9.11] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  75.07
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [59.64, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 16.70] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [44.73, 4.55] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [16.40, 39.47] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |--- weights: [20.13, 6.07] class: 0
|   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 102.09
|   |   |   |   |   |   |   |   |--- weights: [5.22, 144.22] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  102.09
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 109.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [33.55, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  109.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.98, 75.91] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 3.04] class: 0
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |--- weights: [38.02, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 65.38
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  65.38
|   |   |   |   |   |   |   |   |   |--- weights: [24.60, 3.04] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |   |   |--- arrival_date <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [14.91, 72.87] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 1.52] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |--- weights: [84.25, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |--- lead_time <= 125.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.85
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [13.42, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.18] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.85
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  125.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- weights: [58.15, 18.22] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- weights: [61.88, 1.52] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 70.05
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  70.05
|   |   |   |   |   |   |   |   |   |--- lead_time <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- lead_time >  5.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [34.30, 40.99] class: 1
|   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- lead_time >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 10.63] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [155.07, 6.07] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 10.63] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.67
|   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- weights: [63.37, 30.36] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [115.56, 12.14] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [28.33, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.67
|   |   |   |   |   |   |   |--- weights: [0.75, 22.77] class: 1
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 119.25
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 118.50
|   |   |   |   |   |   |   |   |   |--- weights: [18.64, 59.21] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  118.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 1.52] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  119.25
|   |   |   |   |   |   |   |   |--- weights: [34.30, 171.55] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [26.09, 1.52] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 14.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 36.43] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  14.00
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 208.67
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  208.67
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- avg_price_per_room <= 59.43
|   |   |   |   |   |   |   |--- lead_time <= 84.50
|   |   |   |   |   |   |   |   |--- weights: [50.70, 7.59] class: 0
|   |   |   |   |   |   |   |--- lead_time >  84.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  59.43
|   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |--- weights: [20.88, 6.07] class: 0
|   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.34
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [15.66, 78.94] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 102.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  102.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.34
|   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |--- lead_time <= 65.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 120.45
|   |   |   |   |   |   |   |   |   |--- weights: [79.77, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  120.45
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 12.14] class: 1
|   |   |   |   |   |   |   |--- lead_time >  65.50
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 47.06] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 63.76] class: 1
|   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 104.31
|   |   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [39.51, 185.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [73.81, 411.41] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  104.31
|   |   |   |   |   |   |   |   |--- arrival_month <= 10.50
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 195.30
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 9
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  195.30
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 138.15] class: 1
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  10.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 168.06
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 22.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  22.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [17.15, 83.50] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  168.06
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 6.07] class: 0
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- lead_time <= 63.00
|   |   |   |   |   |   |   |--- weights: [15.66, 1.52] class: 0
|   |   |   |   |   |   |--- lead_time >  63.00
|   |   |   |   |   |   |   |--- weights: [0.00, 7.59] class: 1
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [31.31, 13.66] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [0.75, 6.07] class: 1
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- no_of_week_nights <= 10.00
|   |   |   |   |   |   |   |--- weights: [498.03, 40.99] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  10.00
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- arrival_date <= 13.50
|   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |--- weights: [58.90, 36.43] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |--- weights: [33.55, 1.52] class: 0
|   |   |   |   |   |   |--- arrival_date >  13.50
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [123.76, 9.11] class: 0
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 126.33
|   |   |   |   |   |   |   |   |   |--- weights: [32.80, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  126.33
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 13.66] class: 1
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- avg_price_per_room <= 118.55
|   |   |   |   |   |   |   |--- lead_time <= 61.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [70.08, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 11
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [126.74, 1.52] class: 0
|   |   |   |   |   |   |   |--- lead_time >  61.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 66.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  66.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.93
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [54.43, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.93
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |--- avg_price_per_room >  118.55
|   |   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 121.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [18.64, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  121.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [11.93, 10.63] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [37.28, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 119.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 28.84] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  119.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 12
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [49.95, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 18.22] class: 1
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- no_of_week_nights <= 9.50
|   |   |   |   |   |   |   |--- lead_time <= 6.50
|   |   |   |   |   |   |   |   |--- weights: [32.06, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  6.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  5.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 93.09
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  93.09
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [77.54, 27.33] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [19.38, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  9.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.95
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- lead_time <= 150.50
|   |   |   |   |   |   |   |   |   |--- weights: [175.20, 28.84] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  150.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.95
|   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 153.15
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 <= 0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.12
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.12
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.42
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 7.59] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.42
|   |   |   |   |   |   |   |   |   |   |--- weights: [64.12, 60.72] class: 0
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  153.15
|   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- lead_time <= 160.50
|   |   |   |   |   |   |   |--- weights: [2.98, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  160.50
|   |   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- lead_time <= 341.00
|   |   |   |   |   |   |   |--- lead_time <= 173.00
|   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [46.97, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  173.00
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [188.62, 7.59] class: 0
|   |   |   |   |   |   |--- lead_time >  341.00
|   |   |   |   |   |   |   |--- weights: [13.42, 27.33] class: 1
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- lead_time <= 285.50
|   |   |   |   |   |   |   |--- weights: [8.20, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  285.50
|   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |--- lead_time <= 244.00
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.89, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [75.30, 12.14] class: 0
|   |   |   |   |   |   |   |--- lead_time >  244.00
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [25.35, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 264.15] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |--- weights: [46.22, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- lead_time <= 324.50
|   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 986.78] class: 1
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 >  0.50
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |--- lead_time >  324.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 89.00
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  89.00
|   |   |   |   |   |   |   |   |--- weights: [0.75, 13.66] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [1.49, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- arrival_date <= 1.50
|   |   |   |   |   |   |   |--- weights: [1.49, 3.04] class: 1
|   |   |   |   |   |   |--- arrival_date >  1.50
|   |   |   |   |   |   |   |--- weights: [35.79, 1.52] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 96.37
|   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  96.37
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |   |   |--- weights: [7.46, 206.46] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- avg_price_per_room <= 76.48
|   |   |   |   |   |   |   |--- weights: [46.97, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  76.48
|   |   |   |   |   |   |   |--- no_of_week_nights <= 6.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 233.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 152.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  152.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  233.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 19.74] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 269.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  269.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  6.50
|   |   |   |   |   |   |   |   |--- weights: [4.47, 13.66] class: 1
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- arrival_date <= 14.50
|   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |--- arrival_date >  14.50
|   |   |   |   |   |   |   |--- weights: [11.18, 31.88] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [739]:
# Plot the important features
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations on the post-pruning important features¶

  • On the post-pruning model, the important features are as follows: lead_time, market_segment_type_Online, avg_price_per_room, no_of_special_requests, and arrival_month.

Decision Tree Model Performance¶

Decison Tree Model Performance (Training set)¶

In [740]:
# training performance comparison
models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train_without.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[740]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.994211 0.831010 0.899575
Recall 0.986608 0.786201 0.903145
Precision 0.995776 0.724278 0.812762
F1 0.991171 0.753971 0.855573

Decison Tree Model Performance (Test set)¶

In [741]:
# Test performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test_without.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_perf_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[741]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.874299 0.834972 0.868419
Recall 0.814026 0.783362 0.855764
Precision 0.800838 0.727584 0.765363
F1 0.807378 0.754444 0.808043

Overall observations on all models (Default, Pre-Pruning, and Post-Pruning)¶

  • The Decision Tree sklearn is overfitting and is being removed from the decision on best model.
    • Will review just the Pre-Pruning and Post-Pruning models.
  • The Decision Tree (post-pruning) is providng a higher f1 score, on both training and test sets, over the pre-pruning Decision tree.
  • Additionally, the post-pruning model is also consistently performing better than the pre-pruning model in the other performance metrics.
  • Therefore, the Decision Tree (post-pruning) model is selected as the best Decision Tree model.

Model Performance Comparison and Conclusions¶

  • In this project, we created and chose the best Logistics regression model (Threshold=.37) and the Decision Tree (Post-Pruning) was selected as the best Decision Tree model.
  • Let's now compare the best Logistics Regression model (Threshold=.37) with the best Decision Tree (Post-Pruning) model.
In [742]:
# Test performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf_threshold_auc_roc.T,
        decision_tree_post_perf_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-0.37 Threshold",
    "Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[742]:
Logistic Regression-0.37 Threshold Decision Tree (Post-Pruning)
Accuracy 0.796012 0.868419
Recall 0.739353 0.855764
Precision 0.666667 0.765363
F1 0.701131 0.808043

Observations on the overall best model¶

  • The Decision Tree (Post-Pruning) has a much higher F1 score than the Logistic Regression model (Threshold=.37).
  • The Decision Tree (Post-Pruning) additionally has much higher values in the remaining performance metrics.
  • Therefore, overall the Decision Tree (Post-Pruning) is the overall winner and will be selected as the final model for INN Hotel.

Best Overall Model: Decision Tree (Post-Pruning).¶

Actionable Insights and Recommendations¶

  • What profitable policies for cancellations and refunds can the hotel adopt?
  • What other recommedations would you suggest to the hotel?

The top five (5) influential features of the Dynamic Tree (Post-Pruning) model are as follows:¶

- lead_time
- market_segment_type_Online
- avg_price_per_room
- no_special_requests
- arrival_month

What profitable policies for cancellations and refunds can the hotel adopt?¶

  • INN Hotels should update cancellation and refund policies with the following recommendations:
    • Allow for more flexible cancellation policies (up to a certain number of days before booking date) or provide ways for guests to modify their booking information by either booking days, meal plans, room types, etc. This would incentivize guests to book early, but reduce the risk of cancellations.
    • Update the 100% refund date (e.g., 1-2 weeks prior to arrival date) so that INN Hotels has enough time to rebook the room or be in a position to charge the customer a cancellation fee. The cancellation fee can be dynamic as well. For example, if the customer cancels 2 weeks before arrival, they pay a 25% cancellation fee. If they cancel 1 week prior to arrival date, they pay a 50% cancellation fee. Finally, if they cancel the 1-2 days prior to arrival day, they pay 100% cancellation fee.

What other recommedations would you suggest to the hotel?¶

  • Dynamic Pricing Strategy: Recommend that INN Hotels implements dynamic pricing strategies that adjust room rates based on lead time and demand flucuations. Offer cheaper prices for rooms that are booked with large lead times (greater than 120 days). As the arrival date approaches and the liklihood of cancellations decreases, rate can be adjusted upwards to maximize revenue.
  • Targeted Marking: Recommend that INN Hotels implement a targeted marketing program that offers incentives to long-lead time bookings to encourage commitment and reduce the likelihood of cancellation. Offer exclusive promotions or perks for early bookings such as room upgrades, mean plan upgrades, complimentary special requests, or discounts on ancillary services. By incentivizing early bookings, INN Hotels can increase revenue while decreasing the risk of cancellations.
  • Additional Data Capture: Recommend INN Hotels capture additional data to provide quantitative data to support updated cancellation and refund policies based on additional quantitative analysis.
    • Capture additional data on when guests typically cancel before the arrival data. Also, capture additional data on how quickly INN Hotels is able to rebook rooms.
    • Gather additional data on the various room types (price, smoking preference, handicapped etc.). Additional data, may give the hotel insights on how to improve and reduce cancellations.
  • Loyalty Program: If not already established, INN Hotels should update policy to add a loyalty program to encourage repeat guests.
  • Data-Driving Decision Making: Continously monitor and analyze the booking information and using the Decision Tree (Post-Pruning) model, predict whether a guest will cancel or not and make appropriate overbookings to minimize revenue loss.
In [ ]: